Note: Rather than directly knitting this file, you may be better off running each chunk in succession with the little green triangles, and then knitting it afterwards.
The Starbucks data, provided by Danny Kaplan, contains information about every Starbucks in the world:
Starbucks <- read_csv("https://www.macalester.edu/~ajohns24/Data/Starbucks.csv")
Starbucks includes the Latitude and Longitude of each location. We want to construct a visualization of the relationship between these two. THINK: Which of these should go on the y-axis?
ggplot(data=Starbucks) +
geom_point(aes(x=Longitude,y=Latitude), alpha=0.2)
The point pattern probably looks familiar! To highlight the geographical nature of this scatterplot, we can superimpose the points on top of a map, using the ggmap function in the ggmap library. Note: if you have trouble running this next chunk, you may have to (1) install the devtools package and (2) reinstall ggmap using the following line:
devtools::install_github("dkahle/ggmap")
WorldMap <- get_map(location=c(lon=0,lat=0), source="google", zoom=2)
ggmap(WorldMap) +
geom_point(data=Starbucks, aes(x=Longitude,y=Latitude), alpha=0.2)
We can set the center location and zoom level:
US_map <- get_map(location=c(lon=-100,lat=40), source="google", zoom=3)
ggmap(US_map) +
geom_point(data=Starbucks, aes(x=Longitude,y=Latitude), alpha=0.2)
TC_map <- get_map(location=c(lon=-93.1687,lat=44.9398),source="google")
ggmap(TC_map) +
geom_point(data=Starbucks, aes(x=Longitude,y=Latitude))
Exercise 1.1 (Starbucks Locations)
zoom argument and how it works.Exercise 1.2 (Nice Ride Stations) Here is a list of Nice Ride stations around the Twin Cities, along with the number of docks at each station in 2016:
Stations <-
read_csv("http://www.macalester.edu/~dshuman1/data/112/Nice_Ride_2016_Station_Locations.csv")
Make a map of the stations with both the color and size of each glyph set according to the number of docks.
The geom_density_2d and stat_density_2d functions are great for plotting distributions over spatial regions. Here is an example that shows the densities of Starbucks in the North America.
US_map <- get_map(location=c(lon=-100,lat=40), source="google", zoom=4)
ggmap(US_map)+
geom_density_2d(data=Starbucks, aes(x=Longitude,y=Latitude),size=.3)+
stat_density_2d(data = Starbucks,
aes(x=Longitude,y=Latitude, fill = ..level.., alpha = ..level..),
size = 0.1, bins = 20, geom = "polygon") +
scale_alpha(guide = FALSE) +
scale_fill_gradient(low = "green", high = "red",
guide = FALSE)
Here is another nice example of a contour plot for tropical storms, with the code posted here.
Geographical data needn’t be expressed by latitude and longitude. For choropleth maps, instead of visualizing our data as points with different aesthetics (size, color, transparency, etc.), we color different regions of the maps based on data values. To do this we need to specify both the geometric regions on which the data resides (counties, states, zip codes, etc.), and then wrangle the data so that there is one value per region.
As an example, we will reconsider the elect data which included county-level election and demographic variables:
elect <- read_csv("https://www.macalester.edu/~ajohns24/data/electionDemographics16.csv")
Here are a few different ways to do this (others exist as well):1
choroplethrlibrary(choroplethr)
library(choroplethrMaps)
The county_choropleth function requires the variable of interest to be stored as value in the elect data. The following code does this:
#use but don't worry about this syntax
elect <- elect %>% mutate(value=perrep_2016)
Some example maps with the choroplethr package:
county_choropleth(elect)+
scale_fill_manual(values = rev(brewer.pal(7,"RdBu")))
county_choropleth(elect, state_zoom="minnesota")+
scale_fill_manual(values = rev(brewer.pal(7,"RdBu")))
county_choropleth(elect, state_zoom="minnesota", reference_map = TRUE)+
scale_fill_manual(values = rev(brewer.pal(7,"RdBu")))
Exercise 1.3
a. Make and summarize the trends in a national map of winrep_2016, the indicator of whether or not Trump won each county. Don’t forget to first define this as your value of interest:
elect <- elect %>% mutate(value=winrep_2016)
elect variable of your choice!ggplot + geom_mapFor an example that plots state data, see the ggplot2 reference page. This example uses the ggcounty package to plot county data. You may need to first install the devtools package, and then install the ggcounty pacakge via the commands
library(devtools)
devtools::install_github("hrbrmstr/ggcounty")
library(ggcounty)
# reformat the FIPS region codes
elect<-elect%>%
mutate(FIPS=ifelse(region<10000,paste("0",as.character(region),sep=""),as.character(region)))
# define appropriate (& nicely labeled) breaks
elect$brk <- cut(elect$perrep_2016,
breaks=seq(0,100,by=10),
labels=c("0-9", "10-19", "20-29", "30-39",
"40-49", "50-59", "60-69", "70-79", "80-89", "90-100"),
include.lowest=TRUE)
# get the US counties map (lower 48)
us <- ggcounty.us()
# start the plot with our base map
gg <- us$g
# add a new geom with our data (choropleth)
gg <- gg + geom_map(data=elect, map=us$map,
aes(map_id=FIPS, fill=brk))
# define nice colors
gg <- gg + scale_fill_manual(values = rev(brewer.pal(10,"RdBu")),name="Percent Republican")
# plot the map
gg
sf) package: ggplot + geom_sfNote: To use the geom_sf command with ggplot, you’ll need to down the development version of ggplot2. You can follow the instructions here. The only update is that the command to install the development version on line 16 should be install_github("hadley/ggplot2").
Here is an example from Matt Strimas-Mackey’s excellent blog post. Use the function st_read to download the shape file for the counties of North Carolina, which is included in the sf package:
library(sf)
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE)
glimpse(nc)
## Observations: 100
## Variables: 15
## $ AREA <dbl> 0.114, 0.061, 0.143, 0.070, 0.153, 0.097, 0.062, 0.0...
## $ PERIMETER <dbl> 1.442, 1.231, 1.630, 2.968, 2.206, 1.670, 1.547, 1.2...
## $ CNTY_ <dbl> 1825, 1827, 1828, 1831, 1832, 1833, 1834, 1835, 1836...
## $ CNTY_ID <dbl> 1825, 1827, 1828, 1831, 1832, 1833, 1834, 1835, 1836...
## $ NAME <fctr> Ashe, Alleghany, Surry, Currituck, Northampton, Her...
## $ FIPS <fctr> 37009, 37005, 37171, 37053, 37131, 37091, 37029, 37...
## $ FIPSNO <dbl> 37009, 37005, 37171, 37053, 37131, 37091, 37029, 370...
## $ CRESS_ID <int> 5, 3, 86, 27, 66, 46, 15, 37, 93, 85, 17, 79, 39, 73...
## $ BIR74 <dbl> 1091, 487, 3188, 508, 1421, 1452, 286, 420, 968, 161...
## $ SID74 <dbl> 1, 0, 5, 1, 9, 7, 0, 0, 4, 1, 2, 16, 4, 4, 4, 18, 3,...
## $ NWBIR74 <dbl> 10, 10, 208, 123, 1066, 954, 115, 254, 748, 160, 550...
## $ BIR79 <dbl> 1364, 542, 3616, 830, 1606, 1838, 350, 594, 1190, 20...
## $ SID79 <dbl> 0, 3, 6, 2, 3, 5, 2, 2, 2, 5, 2, 5, 4, 4, 6, 17, 4, ...
## $ NWBIR79 <dbl> 19, 12, 260, 145, 1197, 1237, 139, 371, 844, 176, 59...
## $ geometry <simple_feature> MULTIPOLYGON (((-81.4727554..., MULTIPOLY...
The nice thing about this method is that the map info is now stored as a data frame (nc), and we will soon be familiar with how to manipulate data frames. Here is a plot of the areas of the counties, which is just one variable in the frame:
ggplot(nc) +
geom_sf(aes(fill = AREA)) +
scale_fill_viridis("Area") +
ggtitle("Area of counties in North Carolina")
An important question when plotting spatial data is how to project data on Earth’s surface onto a 2D plot. You can see a long list of different projection systems here. Here is an example with the Albers equal area projection:2
ggplot(nc) +
geom_sf(aes(fill = AREA)) +
scale_fill_viridis("Area") +
coord_sf(crs = st_crs(102003)) +
ggtitle("Area of counties in North Carolina (Albers)")
We can also place a choropleth map on top of a Google map:3
NCMap<-get_map(location = c(-80.843,35.227),source="google",maptype="roadmap",zoom=6)
ggmap(NCMap) +
geom_sf(data=nc,aes(fill = AREA),inherit.aes = FALSE) +
scale_fill_viridis("Area") +
ggtitle("Area of counties in North Carolina")
The next two methods read in the spatial geometry information as different types of shape files, rather than as data frames like the sf package. Therefore, I find them a bit more difficult to use, but since the sf package is still in development, some tasks are easier to do with these other approaches.
There are different spatial data structures for storing information such as the boundaries of the regions on which the data reside. For example, a single data point may describe a statistic about a county, and the shape files contain information about points, lines, polygons, or pixels used to describe the boundaries and regions of these counties. Examples of shape file extensions include .shp, .shx, and .dbf. Here is an example of how to load and work with shape files using the sp, rgdal, and maptools packages, and here is a nice open source manuscript on the topic. The packages shapefiles and tigris provide easy access to some shape files, and you can find many others online.
We’ll again start with the map of North Carolina from Google, but we’ll pull the county shapes from the tigris package:
library(tigris)
nc_counties <- counties(state = 'NC',cb = TRUE, resolution = '5m')
Note that nc_counties is not a regular data frame, but a SpatialPolygonsDataFrame, which, suffice it to say, and not as easy to manipulate without some extra practice. Different methods to work with these frames include the tidy function from the broom package (a bit involved), using the combination ggplot + geom_polygon, or using the sp package (often in combination with rgdal). We won’t cover the details of any of these, but you can investigate if you are interested.
One (relatively) easier way to deal with a SpatialPolygonsDataFrame containing the spatial data structure is with the tmap and tmaptools packages, which does more of the formatting behind the scenes. Here are some nice demos. And here is one way to plot our data with the functions from tmap:
library(tmap)
library(tmaptools)
nc_counties@data$ALAND<-as.numeric(nc_counties@data$ALAND)
tm_shape(nc_counties)+
tm_polygons("ALAND",palette=viridis(n=25),style="cont",title="Area of counties in North Carolina")
Personally, I prefer the new geom_sf method described above, since it is a frame and we can do dplyr things on it (like grouping counties, filtering out certain locations, etc., with those operations applying to both the data and the underlying geometries).
leafletWe may return to cover more about interactivity later in the semester, but it is now easy to add interactivity to spatial data visualizations with the leaflet package. Let’s try it out with our last graph:
library(leaflet)
tmap_mode("view")
last_map()
You can now zoom in and out, and also click on individual states to find the land area values. You can read more about leaflet here. This is a really nice demonstration of interactive choropleths with leaflet
Here is another example with the Starbucks data:
leaflet(Starbucks)%>%
setView(-93.1687,44.9398,zoom = 11)%>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lat=~Latitude,lng=~Longitude, popup=~paste(City,": ",`Store Name`,sep=""))
Now that we’ve learned the basics of constructing visualizations, let’s consider using visualizations to tell a story. Here are some examples:
In January 2017, fivethirtyeight.com published an article on hate crime rates across the US. A tidied up version of their data are available at https://www.macalester.edu/~ajohns24/data/hate_crimes_extra.csv
Load the data and store it as US_crime.
What are the units of observation and how many observations are there?
US_crime data were generated from the hate_crimes data set in the fivethirtyeight package. Examine the codebook for the hate_crimes data. Note that US_crime contains these same variables plus 3 more:
crimes_pre = average daily hate crimes per 100,000 population (2010-2015)crimes_post = average daily hate crimes per 100,000 population (November 9-18, 2016)crimes_diff = difference in the average daily hate crimes per 100,000 population after the election vs before the election (crimes_post - crimes_pre)trump_win = an indicator of whether Trump wonIn comparing hate crime rates before and after the election, why is it better to examine crimes_pre vs crimes_post than avg_hatecrimes_per_100k_fbi vs hate_crimes_per_100k_splc? What other issues should we note about these data sources?
Explain why, if we want to study possible connections between hate crime rates and the election, we should use crimes_diff instead of crimes_post in our analysis.
crimes_diff (be sure to note how these values compare to 0 and the contextual significance of this);crimes_diff vs trump_wincrimes_diff vs share_vote_trumpcrimes_diff vs gini_index and trump_winUS_crime <- read_csv("https://www.macalester.edu/~ajohns24/data/hate_crimes_extra.csv")
#make a copy of US_crime
US_crime_new <- US_crime
#treat states as row names
row.names(US_crime_new) <- US_crime_new$state
#take out some variables
US_crime_new <- select(US_crime_new, -c(state,median_house_inc,hate_crimes_per_100k_splc,avg_hatecrimes_per_100k_fbi,crimes_pre,crimes_post,trump_win))
#store as a matrix
crime_mat <- data.matrix(US_crime_new)
Check the dimensions of the new data table:
dim(US_crime_new)
## [1] 51 9
heatmap(crime_mat, Colv=NA, scale="column", col=cm.colors(256))
stars(crime_mat, flip.labels=FALSE, key.loc=c(15,1.5), draw.segments=TRUE)
jotw data contain the first 30,000+ posts to the Facebook group:
jotw <- read_csv("https://www.macalester.edu/~ajohns24/data/jam_of_the_week.csv")
Each row represents a single jam posted to the group. Variables include:
- gender = gender of the musician in the post
- num_reactions, num_comments, etc = number of reactions to, comments on, etc the post
- year_day, week_day, hour = indicators of when the jam was posted
Tell a story about Jam of the Week. In doing so, consider the following research question as a prompt: To what extent does online behavior adhere to or transcend existing biases related to gender?
This demo gives examples of most of these different options.↩
Note: crs is short for “coordinate reference system.”↩
Note: Despite much effort, I have not yet figured out why the two maps don’t align perfectly. It has something to do with the longlat projections and bounding boxes used by ggmap (see here).↩